You are currently looking at version 1.1 of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the Jupyter Notebook FAQ course resource.
In this assignment you will train several models and evaluate how effectively they predict instances of fraud using data based on this dataset from Kaggle.
Each row in fraud_data.csv
corresponds to a credit card transaction. Features include confidential variables V1
through V28
as well as Amount
which is the amount of the transaction.
The target is stored in the class
column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud.
In [1]:
import numpy as np
import pandas as pd
In [2]:
def answer_one():
# Your code here
df = pd.read_csv('fraud_data.csv')
return df['Class'].sum()/len(df['Class'])
In [3]:
# answer_one()
In [4]:
# Use X_train, X_test, y_train, y_test for all of the following questions
from sklearn.model_selection import train_test_split
df = pd.read_csv('fraud_data.csv')
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
Using X_train
, X_test
, y_train
, and y_test
(as defined above), train a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? What is the recall?
This function should a return a tuple with two floats, i.e. (accuracy score, recall score)
.
In [5]:
def answer_two():
from sklearn.dummy import DummyClassifier
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
# Your code here
d = DummyClassifier(strategy='most_frequent')
d.fit(X_train, y_train)
y_p = d.predict(X_test)
return (accuracy_score(y_p, y_test), recall_score(y_p, y_test))
In [6]:
# answer_two()
Out[6]:
In [41]:
def answer_three():
from sklearn.metrics import recall_score, precision_score, accuracy_score
from sklearn.svm import SVC
scv = SVC()
scv.fit(X_train, y_train)
y_p = scv.predict(X_test)
return (scv.score(X_test, y_test), recall_score(y_test, y_p), precision_score(y_test, y_p))
In [42]:
# answer_three()
Out[42]:
In [37]:
def answer_four():
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
# Your code here
scv = SVC(C=1e9, gamma=1e-07)
scv.fit(X_train, y_train)
scores = scv.decision_function(X_test)
y_pred_with_threshold = scores > -220
confusion = confusion_matrix(y_test, y_pred_with_threshold)
return confusion
In [38]:
# answer_four()
Out[38]:
Train a logisitic regression classifier with default parameters using X_train and y_train.
For the logisitic regression classifier, create a precision recall curve and a roc curve using y_test and the probability estimates for X_test (probability it is fraud).
Looking at the precision recall curve, what is the recall when the precision is 0.75
?
Looking at the roc curve, what is the true positive rate when the false positive rate is 0.16
?
This function should return a tuple with two floats, i.e. (recall, true positive rate)
.
In [18]:
def draw_pr_curve():
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_curve, auc
lr = LogisticRegression()
lr.fit(X_train, y_train)
y_scores_lr = lr.decision_function(X_test)
precision, recall, thresholds = precision_recall_curve(y_test, y_scores_lr)
closest_zero = np.argmin(np.abs(thresholds))
closest_zero_p = precision[closest_zero]
closest_zero_r = recall[closest_zero]
import matplotlib.pyplot as plt
plt.figure()
plt.xlim([0.0, 1.01])
plt.ylim([0.0, 1.01])
plt.plot(precision, recall, label='Precision-Recall Curve')
plt.plot(closest_zero_p, closest_zero_r, 'o', markersize = 12, fillstyle = 'none', c='r', mew=3)
plt.xlabel('Precision', fontsize=16)
plt.ylabel('Recall', fontsize=16)
plt.axes().set_aspect('equal')
plt.show()
# draw_pr_curve()
In [20]:
def draw_roc_curve():
%matplotlib notebook
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_scores_lr)
roc_auc_lr = auc(fpr_lr, tpr_lr)
plt.figure()
plt.xlim([-0.01, 1.00])
plt.ylim([-0.01, 1.01])
plt.plot(fpr_lr, tpr_lr, lw=3, label='LogRegr ROC curve (area = {:0.2f})'.format(roc_auc_lr))
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve (1-of-10 digits classifier)', fontsize=16)
plt.legend(loc='lower right', fontsize=13)
plt.plot([0, 1], [0, 1], color='navy', lw=3, linestyle='--')
plt.axes().set_aspect('equal')
plt.show()
# draw_roc_curve()
In [21]:
def answer_five():
# Your code here
return (0.8, 0.9)
Perform a grid search over the parameters listed below for a Logisitic Regression classifier, using recall for scoring and the default 3-fold cross validation.
'penalty': ['l1', 'l2']
'C':[0.01, 0.1, 1, 10, 100]
From .cv_results_
, create an array of the mean test scores of each parameter combination. i.e.
l1 |
l2 |
|
---|---|---|
0.01 |
? | ? |
0.1 |
? | ? |
1 |
? | ? |
10 |
? | ? |
100 |
? | ? |
This function should return a 5 by 2 numpy array with 10 floats.
Note: do not return a DataFrame, just the values denoted by '?' above in a numpy array.
In [36]:
def answer_six():
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
# Your code here
clf = LogisticRegression()
grid_values = {'penalty': ['l1', 'l2'], 'C':[0.01, 0.1, 1, 10, 100]}
grid_clf_acc = GridSearchCV(clf, param_grid = grid_values, scoring = 'recall', cv=3)
grid_clf_acc.fit(X_train, y_train)
return grid_clf_acc.cv_results_['mean_test_score'].reshape(-1, 2)
In [34]:
# r = answer_six()
In [35]:
# r
Out[35]:
In [26]:
# r['mean_test_score']
Out[26]:
In [29]:
# r['mean_test_score'].reshape(-1, 2)
Out[29]:
In [32]:
# Use the following function to help visualize results from the grid search
def GridSearch_Heatmap(scores):
%matplotlib notebook
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure()
sns.heatmap(scores.reshape(5,2), xticklabels=['l1','l2'], yticklabels=[0.01, 0.1, 1, 10, 100])
plt.yticks(rotation=0);
# GridSearch_Heatmap(answer_six())
In [ ]: